MSPM: A modularized and scalable multi-agent reinforcement learning-based system for financial portfolio management

Financial portfolio management (PM) is one of the most applicable problems in reinforcement learning (RL) owing to its sequential decision-making nature. However, existing RL-based approaches rarely focus on scalability or reusability to adapt to the ever-changing markets. These approaches are rigid and unscalable to accommodate the varying number of assets of portfolios and increasing need for heterogeneous data input. Also, RL agents in the existing systems are ad-hoc trained and hardly reusable for different portfolios. To confront the above problems, a modular design is desired for the systems to be compatible with reusable asset-dedicated agents. In this paper, we propose a multi-agent RL-based system for PM (MSPM). MSPM involves two types of asynchronously-updated modules: Evolving Agent Module (EAM) and Strategic Agent Module (SAM). An EAM is an information-generating module with a Deep Q-network (DQN) agent, and it receives heterogeneous data and generates signal-comprised information for a particular asset. An SAM is a decision-making module with a Proximal Policy Optimization (PPO) agent for portfolio optimization, and it connects to multiple EAMs to reallocate the corresponding assets in a financial portfolio. Once been trained, EAMs can be connected to any SAM at will, like assembling LEGO blocks. With its modularized architecture, the multi-step condensation of volatile market information, and the reusable design of EAM, MSPM simultaneously addresses the two challenges in RL-based PM: scalability and reusability. Experiments on 8-year U.S. stock market data prove the effectiveness of MSPM in profit accumulation by its outperformance over five different baselines in terms of accumulated rate of return (ARR), daily rate of return (DRR), and Sortino ratio (SR). MSPM improves ARR by at least 186.5% compared to constant rebalanced portfolio (CRP), a widely-used PM strategy. To validate the indispensability of EAM, we back-test and compare MSPMs on four different portfolios. EAM-enabled MSPMs improve ARR by at least 1341.8% compared to EAM-disabled MSPMs.


Introduction
Portfolio management (PM) is a continuous process of reallocating capital into multiple assets [1], and it aims to maximize accumulated profits with an option to minimize the overall risks of the portfolio. To perform such a practice, portfolio managers who focus on stock markets conventionally read financial statements and balance sheets, follow the news from media and announcements from financial institutions and analyze stock price trends. By the resemblant nature of the problem, researchers expectedly wish to incorporate deep reinforcement learning (DRL) methods in PM. As one of the attempts, the authors of [2] propose a PM framework for cryptocurrencies using Deep Deterministic Policy Gradient (DDPG) [3,4]. [5] proposes a method called Adversarial Training for portfolio optimization with the implementation of three different RL methods: DDPG, Proximal Policy Optimization (PPO) [6] and Policy Gradient (PG). Akin to receiving information from various sources as portfolio managers generally do, existing approaches incorporate heterogeneous data [7]. Recently, multi-agent reinforcement learning (MARL) approaches are also proposed by researchers [8][9][10]. In [10], the authors propose MAPS, a system involving a group of Deep Q-network [11] (DQN)-based agents corresponding to individual investors, to make investment decisions and create a diversified portfolio. MAPS can be recognized as a reinforcement-learning implementation of ensemble learning [12] by its very nature. In addition, [13] proposes iRDPG to generate adaptive quantitative trading strategies by using DRL and imitation learning. However, while inspiring, the existing approaches seldom focus on scalability and reusability to accommodate the ever-changing markets. RL agents in the existing multi-agent-based systems are ad-hoc trained and rarely reusable for different portfolios. Also, the existing systems are barely scalable to answer the need for scaled number of assets in portfolios and increasing heterogeneous data input. For example, in SARL [7], the encoder's intake is either financial news data for embedding or stock prices for trading signals generation, but can not be both of them, and this issue prevents the encoder from efficiently producing holistic information and eventually limits the RL-based agents' learning. Furthermore, the existing systems lack a modular design to be compatible with different RL agents for different assets. In this paper, we propose MSPM, a novel multi-agent reinforcement learning-based system, with a modularized and scalable architecture for PM. In MSPM, assets are vital and organic building blocks. This vitalness is reflected in that each asset has its dedicated module: Evolving Agent Module (EAM). An EAM takes heterogeneous data and utilizes a DQN-based agent to produce signalcomprised information. After we set up and trained the EAMs corresponding to the assets in a portfolio, we connected them to a decision-making module: Strategic Agent Module (SAM). An SAM represents a portfolio and uses the profound information from the connected EAMs for asset reallocation. EAM and SAM are asynchronously updated, and EAMs' reusability allows themselves to be combined and connected to multiple SAMs discretionarily. With the power of parallel computing, we can perform capital reallocation for various portfolios at scale, simultaneously.
To evaluate MSPM's performance, we back-test and compare MSPM to five different baselines on two different portfolios. MSPM outperforms all the baselines in terms of accumulated rate of return, daily rate of return, and Sortino ratio. For instance, MSPM improves accumulated rate of return by 49.3% and 426.6% compared to the state-of-the-art RL-based method: Adversarial PG [5] on the two portfolios. We also inspect the position-holding of five different EAMs to exemplify the high quality and reliability of the signals generated by EAM. Specifically, the average winning rate of the EAMs in the two portfolios achieves 80%. Furthermore, we validate the necessity of EAM by back-testing and comparing the EAM-enabled and disabled MSPMs on four different portfolios. EAM-enabled MSPMs improve accumulated rate of return by at least 1341.8% compared to the EAM-disabled MSPMs. Our contribution can be listed as follows: • To the best of our knowledge, MSPM is the first approach that formalizes a modularized and scalable multi-agent reinforcement learning system using signal-comprised information for financial portfolio management.
• MSPM with its modularized and reusable design addresses the issue of ad-hoc, fixed, and inefficient model training in the existing RL-based methods.
• By experiment and comparison, we confirm that our MSPM system outperforms five different baselines under extreme market conditions of U.S. stock markets during the global pandemic, from January to December 2020.

Related work
In the early years, researchers and professionals believe that certain behaviors of price and volume will repeat periodically and consistently. Based on this recognition, the technical indicators (TI) are invented by using historical price and volume data to predict the movement of asset prices [15]. TIs are mostly formulas or particular patterns, and the trading strategies that utilize TIs are referred to as technical analysis (TA) [16]. However, as pre-defined formulas and patterns cannot cover all market movements, it is getting harder and harder for TA to adapt to the fast-changing market. With the increase in computing power and available data, researchers have started to use deep learning (DL) to predict stock price movements. DL uses highdimensional data to train complex and non-linear neural network models as trading strategies. Fortunately, DL's adaptability to the market is promisingly improved compared to TA. Recently, deep reinforcement learning (DRL) has emerged rapidly as the combination of DL and reinforcement learning (RL). By utilizing neural networks (NN), a DRL-based agent is particularly good at extracting useful information from high-dimensional data and taking sequential actions based on rewarding. DRL methods have led to many breakthroughs in multiple fields. For instance, [11] successfully utilizes Deep Q-learning agents to learn directly from high-dimensional raw pixel input to play video games. Due to the sequential decisionmaking nature of financial investment, researchers naturally attempt to solve stock trading problems using DRL methods. [2] designed a cryptocurrencies portfolio management (PM) framework using Deep Deterministic Policy Gradient (DDPG) [3,4] which is a model-free DRL algorithm. [5] proposes the Adversarial Training method to improve training efficiency using three different RL methods: DDPG, Proximal Policy Optimization (PPO) [6] and Policy Gradient (PG). Although these approaches have presented potential performance, the data input of these approaches is still traditional historical data, namely opening-high-low-closing prices (OHLC) and trading volumes. Unlike preceding research, [7] proposes SARL, an RL framework that can incorporate heterogeneous data to generate PM strategies. Moreover, to address the challenge of balancing between exploration and exploitation, [13] proposes iRDPG for developing trading strategies by DRL and imitation learning. Multi-agent systems have also been proposed. In [10], the authors propose MAPS, a cooperative system containing multiple agents, to create diversified portfolios and to adapt to the continuously changing market conditions. However, while the existing approaches tackle PM problems with promising methods and techniques, these systems, with the strategies generated, are mostly fixed and ad-hoc. The existing systems or frameworks lack a modular design to be compatible with different trained RL agents. The RL agents trained for one portfolio can hardly be reused for different portfolios. These systems also lack scalability to accommodate the increasing number of assets and profundity of market information. In this paper, we propose MSPM for solving the problems.

Data acquisition
The historical price data used in this paper are QuoteMedia's End of Day US Stock Prices (EOD) [17] from Jan 2013 to Dec 2020 obtained using Nasdaq Data Link's API, which can be accessed by subscribing at: https://data.nasdaq.com/data/EOD-end-of-day-us-stock-prices. We also use web news sentiment data (FinSentS) [18] from Nasdaq Data Link provided by InfoTrie, which can be accessed by subscribing at: https://data.nasdaq.com/databases/NS1/ data.

Feature selection and data curation
We select the adjusted-close, open, high, and low prices and volumes features from QuoteMedia's EOD data as the historical price data. We also select the sentiment and news_buzz from InfoTrie's FinSentS Web News Sentiment. Each feature in EOD data is normalized by dividing the first (day-one) value of that feature, and there is no missing value in any of these features.
For FinSentS data, we use original values of the sentiment feature in FinSentS data, and we fill the missing values (accounting for 9.51% of the total data) prior year 2013 with a neutral sentiment: zero (0). Since the FinSentS data are not as straightforward as EOD data, we put the description of the selected features of FinSentS data in Table 1.

Methodology
Our MSPM system consists of two types of modules: EAM and SAM. The relationship between EAMs and SAMs is illustrated in Fig 1. Fig 2 illustrates a even more intuitive overview of MSPM's architecture. To accommodate MSPM in the sequential decision-making problems financial portfolio management, we configured the specific settings for EAM and SAM. An

PLOS ONE
EAM contains a DQN agent and acts to generate signal-comprised information (historical prices with buy/closing/skip labels) for a designated asset. To train the agent in EAM, we constructed a sequential decision-making problem with designated asset's historical prices and financial news as the state that the agent observes at each time step. An DQN agent acts to buy or close a position, or simply to skip at every time step based on the latest prices and financial news data input, in order to maximize its total reward. The actions (signals) then will be matched and stacked back to the corresponding price data to formalize the signal-comprised information. EAM's architecture is illustrated in Fig 3. On the other hand, an SAM manages an investment portfolio and contains a PPO agent that reallocates the assets in that portfolio. SAMs are connected to multiple EAMs as an investment portfolio often has more than one asset. In the decision-making process of SAM, the state that the PPO agent observes at each time step is the combination of the signal-comprised information which the connected EAMs generate. Further, the PPO agent acts to generate the reallocation weights for the assets in the portfolio, which total up to 1.0. Fig 4 provides an overview of the SAM's architecture. For both EAM and SAM, the composition of the assets' historical prices and financial news or news sentiments is the environment their agents interact with. Each EAM is reusable. Once an EAM is set up and trained, it can be effortlessly connected to any SAM. An SAM connects to at least one EAM. EAMs are retrained periodically using the latest information from the market,

Fig 1. Overview of the surjection relationship between Evolving Agent Modules (EAMs) and Strategic Agent Modules (SAMs).
Each EAM is responsible for a single asset and employs a DQN agent, and it utilizes heterogeneous data to produce signal-comprised information. Each SAM is a module for a portfolio that employs a PPO agent to reallocate the assets with stacked signal-comprised 3-D tensor profound state V + from EAMs connected. Moreover, trained EAMs are reusable for different portfolios and therefore can be combined and connected to any SAMs at will. By parallel computing, capital reallocation may be performed for various portfolios at scale simultaneously. https://doi.org/10.1371/journal.pone.0263689.g001

PLOS ONE
media, financial institutions, etc., and we implemented the former two data sources in this study. In the following sections, we explain the technical details of EAM and SAM.

Evolving Agent Module (EAM)
State. At any given periodic (daily) time-step t, the agent in EAM observes state v t , which consists of the designated asset's recent n-day historical prices s t and sentiment scores ρ t . Specifically, where s includes the designated asset's n-day close, open, high and low prices and volumes. ρ includes the predicted and averaged news sentiments, using a pre-trained FinBERT classifier [19,20] for asset-related financial news, which ranges continuously from -5.0 to 5.0, indicating bearishness (-5.0) or bullishness (5.0). Furthermore, ρ also includes news_buzz. This attribute is an attempt to alleviate the unbalanced-news issue in the existing research [7]. Instead of restarting from the beginning after every episodic reset of the environment, the environment resets at a random time point of the data [21]. Because the news sentiments from FinSentS data and the sentiments generated by FinBERT are similar, and due to the restriction of APIs and web scraping, we only utilize FinSentS data as the sentiments input for the experiments in this paper.
Deep Q-network. For an EAM, we train a Deep Q-network (DQN) agent and follow the sequential decision-making of Deep Q-learning [11]. Deep Q-learning is a value-based method that derives a deterministic policy π(θ), which is a mapping: S ! A from state space to discrete action space. We use a Residual Network with 1-D convolution [22] to represent the state-

PLOS ONE
value function Q θ which the agent acts based on: For information about model selection for EAM and hyperparameter tuning, see S1 Appendix.
DQN extensions. We implement three extensions [21] of the original DQN, namely dueling architecture [23], Double DQN [24] and two-step Bellman unrolling. An EAM is a module for a designated asset. Each EAM takes two types of heterogeneous data: 1. designated asset's historical prices and 2. asset-related financial news. At the center of an EAM is an extended DQN agent using a 1-D convolution ResNet for sequential decision making. Instead of training every EAM from scratch, we train EAMs by transfer learning using a foundational EAM. At every time step t, the DQN agent in EAM observes state v t of historical prices s t and news sentiments ρ t of the designated asset, and acts to trade with an action a sig t of either buying, selling, or skipping, and eventually generates a 2-D signal-comprised tensor s sc t using new prices s t and signals a sig t . https://doi.org/10.1371/journal.pone.0263689.g003

PLOS ONE
Transfer learning. Instead of training every EAM from scratch, we initiate and train a foundational EAM, using historical prices of AAPL (Apple Inc.), and then train all other EAMs based on this pre-trained EAM. By doing so, the foundational EAM shares its parameters with other EAMs which obtains prior knowledge of the pattern of stock trends. This transfer learning approach may help to tackle the data-shortage issue of newly-listed stocks due to the limited historical prices and news data available for training purposes.
Action. The DQN agent in EAM acts to trade the designated asset with an action of either buying, selling, or skipping, at every time step t. The choice of an action, a t = {buying, closing, or skipping}, is called an asset trading signal. As indicated in the actions, there is no short (selling) position, and a new position will be opened only after an existing position has been closed.
Reward. The reward, r t , received by the DQN agent at each time step t is: where v ðcloseÞ t is the close price of the given asset at time step t. t l is the time step when a long position is opened and commissions are deducted, β stands for the commission of 0.0025 and ι t is the indicator of an opening position (i.e., a position is still open).

Strategic Agent Module (SAM)
State (stacked signal-comprised tensor). Once EAMs have been trained, we feed new historical prices, s t , and financial news of the designated assets, to generate predictive trading signals a sig t . Then we stack the same new historical prices to a sig t to formalize a 2-D signal-comprised tensor s sc t as the data source to train SAM. Because an SAM is connected to multiple EAMs, the 2-D signal-comprised tensors from all connected EAMs are stacked and transformed into a 3-D signal-comprised tensor called profound state v þ t , which is the state that SAM observes at each time step t.
Proximal policy optimization. A PPO [6] agent is at the center of SAM to reallocate assets. PPO is an actor-critic style policy gradient method that has been widely used on continuous action space problems, due to its desirable performance and ease of implementation. A policy π θ is a parametrized mapping: S × A ! [0, 1] from state space to action space. Among the different objective functions of PPO, we implement the clipped surrogate objective [6]: where and A y 0 t , the advantage function, is expressed as: in which, the state-action value function Q y 0 ðs t ; a t Þ is: and the value function V y 0 ðs t Þ is: For the PPO agent, we design a policy network architecture targeting the uniqueness of continuous action space in financial portfolio management problems, inspired by the EIIE topology [2]. Because assets' reallocated weights at time step t are strictly required to total up to 1.0, we set m � normal distributions N 1 ðm 1 t ; sÞ; . . . ; N m � ðm m � t ; sÞ, and we sample x t 2 R m � �1 from the distributions, where m � = m+ 1 and m t 2 R 1�m � �1 is the linear output of the last layer of the neural network and with standard deviation σ = 0. We eventually obtain the reallocation weights a t = Softmax(x t ) and the log probability of x t for the PPO agent to learn. Fig 5 shows the details of the policy network (actor) of SAM, denoted by θ 0 . Due to the resemblance and equivalence, architectures of the value network (critic) and target policy network, denoted by θ, are not illustrated.
Action. The action the PPO agent takes at each time step t is which is the vector of reallocating weights at each time step t, and P m � i¼1 a i;t ¼ 1. Once the assets are reallocated by a t , the allocation weights of the portfolio eventually become is the relative price vector, that is, the changes of asset prices over time, including the prices of assets and cash. v þðcloseÞ i;t denotes the closing price of the i-th asset at time t, where i = {2, . . ., m � }, excluding cash (risk-free asset) whose closing price should always be 1.
Reward. Inspired by [2] in which the agent maximizes the sum of the logarithmic value, and [5] in which the authors try to cluster the periodic portfolio risk to alleviate the biases in training data and to prevent exposure to highly-volatile assets, we set the reward to be a riskadjusted rate of return, r � t , which PPO agent receives at each time step t: where m � is the number of assets, w t represents the allocation weights of the assets at the end of is the transaction cost, where β = 0.0025 is the commission rate, and φ = 0.001 is the risk discount which can be fine-tuned as a hyperparameter.
measuring the volatility of fluctuation in assets' prices during the last n days.

PLOS ONE
is the volatility of the profit of an individual asset. We expect the agent to secure a maximum risk-adjusted rate of return (capital gain) every time step, as what is expected from human portfolio managers.

Experiments
In this section, we build different portfolios, and train MSPM to periodically reallocate the assets in each portfolio. The portfolios, datasets, and performance metrics for benchmarking will be introduced and described. After that, we explain and discuss the experimental results and examine MSPM's stability of daily rate of return. We also inspect the signal generation and position-holding of EAMs. In the end, we validate the necessity of EAM by back-testing four different portfolios. The back-testing performance of MSPM will be compared with the existing baselines. . Additionally, the two SAMs shared the same EAM for the stock in common: Alphabet (GOOGL). Later, we propose two other portfolios (c) and (d), which make four portfolios in total, to validate the necessity of EAM. Details can be found in the Validation of EAM section. For all these four portfolios, we set initial portfolio value to be p 0 = 10, 000.

Preliminaries
Data ranges. Among the EAMs to be trained, the foundational EAM (AAPL) is trained initially, and its parameters are shared with other EAMs as their foundation for transfer learning. As shown in Table 2, EAM-training data, ranging from January 2009 to December 2015, contains the historical prices (s t ) and news sentiments (ρ t ) of the stocks, including AAPL, in portfolios (a) and (b). EAM-predicting data, with the same data structure as EAM-training and ranging from January 2016 to December 2020, is used for EAMs to predict and generate trading signals (actions of DQN agents). Then, EAM-predicting data along with the generated trading signals became the signal-comprised data for SAM/MSPMs. There are three datasets of signal-comprised data: SAM/MSPM-training and SAM/MSPM-validating to train and validate SAMs, respectively; and SAM/MSPM-experiment, from January 2020 to December 2020,

PLOS ONE
for back-testing and other experiments. Details can be found in Table 2. It is worth noting that a low percentage (9.51%) of missing values from the alternative data (sentiments) shall not affect MSPM's scalability nor reusability since, as a general framework, MSPM is neutral on the structures, types, or sources of the data input.
Performance metrics. We use the following performance metrics to measure the performances of the baselines and MSPM system.
• Daily Rate of Return (DRR) where T is the terminal time step, and is the risk-unadjusted periodic (daily) rate of return obtained at every time step, where b P m � i¼1 ja i;t À w i;t j is the transaction cost and β = 0.0025 is the commission rate. • Accumulated rate of return (ARR) The accumulated rate of return (ARR) [26] is where T is the terminal time step, p 0 is the portfolio value at the initial time step, and which stands for the portfolio value at the terminal time step.
• Sortino ratio (SR) Sortino ratio [27] is often referred to as a risk-adjusted return, which measures the portfolio performance compared to a risk-free return, adjusted by the portfolio's downside risk. In our case, Sortino ratio is calculated as where R t is the risk-unadjusted periodic (daily) rate of return. Portfolio's downside risk σ downside is calculated as ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where R f is the risk-free return and conventionally equals zero, R l are the less-than-zero returns in R t for all t, and t = T is the terminal time step.
• Max drawdown (MD) MD is the biggest drop (in %) between the highest (peak) and lowest (valley) of the accumulated rate of return of a certain period of time.
For DRR, ARR and SR, we want them to be as high as possible, whereas we want MD to be as low as possible.

Back-testing performance
We back-test and compare the performance of our MSPM system to different baselines, including the traditional and cuttings-edge RL-based portfolio management strategies [28,29]. The baselines are listed as follows: • CRP stands for (Uniform) Constant Rebalanced Portfolio, which involves investing an equal proportion of capital in each asset, namely 1/N, which seems simple but, in fact, challenging to beat [14].
• Buy and hold (BAH) strategy involves investing without rebalancing. Once the capital is invested, no further allocation will be made.
• Exponential gradient portfolio (EG) strategy involves investing capital into the latest stock with the best performance and uses a regularization term to maintain the portfolio information.
• Follow the regularized leader (FTRL) strategy tracks the Best Constant Rebalanced Portfolio until the previous period, with an additional regularization term. This strategy reweights based on the entire history of the data with an expectation to obtain maximum returns.
• ARL refers to the adversarial deep reinforcement learning in portfolio management (Adversarial PG) [5], which is a state-of-the-art (SOTA) RL-based portfolio management method.
As shown in Figs 7 and 8, for both portfolios (a) and (b), MSPM system improves ARR, by at least 49.3% and 426.6% compared to ARL, a SOTA RL-based PM method, and by 186.5% and 369.8% compared to CRP, a traditional PM strategy, during the year of 2020. The result demonstrates the advantage of MSPM at gaining capital returns. Table 3 gives details about MSPM's outperformance over existing baselines in terms of the ARR and DRR. Further, MSPM's superior performance on SR indicates that MSPM takes better consideration of harmful volatility and achieves higher risk-adjusted returns.
It is worth noting that for portfolio (a), both MSPM and ARL achieve promising SR, but for portfolio (b), only MSPM has a much better Sortino ratio than ARL, which indicate MSPM's

PLOS ONE
higher adaptability to the ever-changing market compared to not only the traditional strategies but also the preceding RL-based method.

Stability of daily rate of return (DRR)
Due to the high max drawdown (MD) of MSPM for portfolio(b) (60.6%), we want to examine and compare the general stability of DRR between MSPM and the state-of-the-art RL-based method: ARL. For this purpose, we first calculate DRR's 5-day rolling standard deviation (RstdDRR) as the proxy of the stability of DRR. Higher RstdDRR indicates lower stability of DRR.
To calculate the RstdDRR, we first calculate the simple moving average (SMA) [30] of DRR 2 R k for the past n data-points (days) by the following formula: for i = n, . . ., k. Then, we subtract SAM i from the 5-day DRRs used in the calculation, and then take the square root of the squared summation to have the rolling standard deviation: RstdDRR 2 R kÀ n : RstdDRR i ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi where i = n, . . ., k.   Since the histograms in Figs 9 and 10 show skewed bell shapes, we use Shapiro-Wilk test [31] to confirm the normality of the distributions. After that, we use Levene's test [32] to examine the variance equality. We use Python's SciPy library to perform these two tests. By implementing Shapiro-Wilk test, we find that MSPM and ARL's RstdDRR are not statistically from normal distributions for both portfolios (p-values are less than 0.05). Moreover, according to Levene's test, MSPM and ARL's RstdDRR do not always have homogeneity of variance: for portfolio (a) they do, whereas for portfolio(b) they do not. With the assumptions verified, we perform the one-tail and two-sample Mann-Whitney U test [33] (a non-parametric version of unpaired t-test) to rigorously compare MSPM and ARL's stability of DRR, also using Python's SciPy library. For portfolio(a), because the mean RstDRR of MSPM is less than the mean RstDRR of ARL, the hypothesis H 0 is that MSPM has a lower or same stability than ARL (the group mean of RstdDRR of MSPM is greater or equal to that of ARL), and the alternative hypothesis H a is that MSPM has higher stability than ARL(the group mean of RstdDRR of MSPM is less than that of ARL). For portfolio(b), because the mean RstDRR of MSPM is higher than the mean RstDRR of ARL, the hypothesis H 0 is that MSPM has higher or same

PLOS ONE
stability than ARL (the group mean of RstdDRR of MSPM is less or equal to that of ARL), and the alternative hypothesis H a is that MSPM has a lower stability than ARL(the group mean of RstdDRR of MSPM is greater than that of ARL). We set the significance level to be.05. If the pvalue from the test is less than 0.05, we reject H 0 and accept H a ; otherwise, we accept the null hypothesis H 0 . The detailed settings of the statistical test are: • • Significance level: .05 As the results represented in Table 4, MSPM has significantly higher stability of DRR than ARL for portfolio(a) by rejecting H 0 and accepting H a (U a = 25426.0, p − value =.005). For portfolio(b), because H 0 is accepted (U b = 16209.0, p − value <.001), we confirm that MSPM has lower stability of DRR than ARL. The conclusions are aligned with the MD in Table 3 and the underwater plots in S3-S6 Figs. which illustrate the drawdowns during year 2020. It is clear in S3 and S4 Figs that ARL has more frequent and intensive drawdowns for portfolio(a) compared to MSPM, but MSPM becomes the more volatile one for portfolio(b) according to S5 and S6 figs. The results indicate that although MSPM achieves an outstanding performance in gaining capital returns, it does not naturally come with higher stability. However, low stability (or high risk) does not necessarily refer to danger. Since for both portfolio (a) and (b), MSPM has the highest Sortino ratios, which consider only the downside risk, MSPM's lower stability for portfolio (b) may come from a higher upside risk. In conclusion, there should be a trade-off between performance and stability, and this can be further investigated and considered in future studies.

EAM: Case study
To better understand how EAM contributes to SAM, we illustrate the position-holding information using the signals generated by the EAMs of portfolio

PLOS ONE
closing prices, we color the period as light green (winning position), otherwise light red. Period of no-position will be left as blank. According to the results illustrated in the figures, the positions are opened and closed at just the right timings by the corresponding EAMs for most assets.

PLOS ONE
As shown in Table 5, the number of positions opened by any EAM is less than ten, and the highest is NVDA and TSLA's eight opened positions. The most profit-making EAM is TSLA, with ARR of 799%. These results exemplify the high quality and reliability of the signals generated by the EAMs. The winning rates of all the five EAMs are more than 50%. Since averaged winning rate is 80%, it indicates that even with a mediocre averaged winning rate, SAM still can efficiently utilize the information generated by the EAMs and has the outperformance compared to ARL. The results also indicate that the MSPM can perform even better if we improve the winning rate of EAMs.

PLOS ONE
The results validate that the SAMs can only have an ideal performance with the trading signalcomprised information from EAMs.

Discussion on scalability and reusability of MSPM
To address the issue of inefficient model training in RL-based PM, EAMs are designed to be independent and reusable. Once an EAM has been trained, it can be added to any SAM without retraining. For example, in the previous sections, portfolio(a) and portfolio(b) share one EAM in common: GOOGL, and it saves time and resources from redundant model training.
On the other hand, to address the issues of ad-hoc and fixed model training in RL-based PM, MSPM allows the number of EAMs connected to any single SAM to be scaled up. In the EAM: Case study section, each EAM represents a single asset, and since these EAMs are trained, they are ready to be connected to any SAM. For example, to build a portfolio containing two assets, e.g., AAPL and TSLA, we can connect the corresponding two EAMs to an SAM to train and build the portfolio. Meanwhile, the rest of the EAMs can also be used in other portfolios. If later we want to scale up the volume of this portfolio to four assets, we simply add two more EAMs, e.g., GOOGL and NVDA, to the SAM without wasting time for training the EAMs again. Although SAM needs to be retrained once its volume is scaled up, the benefits brought by the EAMs are considerable since it has been validated in the previous section that the performance of an EAM-enabled SAM is largely improved compared to an EAM-disabled EAMs. Moreover, MSPM's scalability allows EAMs to accommodate the need for heterogeneous and alternative data input, like the sentiments data utilized in our research. Therefore, with MSPM's scalability and reusability to create dynamic and adaptive portfolios, researchers and portfolio managers can simultaneously perform capital reallocation for various portfolios of a large volume of assets at scale by parallel computing.

Limitations and future work
In this paper, to accommodate MSPM in sequential decision-making problems of PM, we only implement DQN and PPO to formalize the agents in EAM and SAM modules. We left the implementation of other algorithms in MSPM to future studies. Additionally, the trade-off between the stability of DRR and the performance metrics (ARR, DRR, or SR) may be further considered when designing the reward functions in future studies. We only implement the historical prices and sentiments data in this research, and we plan to utilize more heterogeneous data, e.g., satellite images, in the future studies.

Conclusion
We propose MSPM, a modularized multi-agent RL-based system, to bring scalability and reusability to financial portfolio management. We design and develop two types of modules in

PLOS ONE
MSPM: EAM and SAM. EAM is an asset-dedicated module that takes heterogeneous data and utilizes a DQN-based agent to generate signal-comprised information. On the other hand, SAM is a decision-making module that receives stacked information from the connected EAMs to reallocate the assets in a portfolio. As EAMs can be combined and connected to any SAMs at will, with this modularized and reusable design, MSPM addresses the issue of ad-hoc, fixed, and inefficient model training in the existing RL-based methods. By experimenting, we confirm that MSPM outperforms various baselines in terms of the accumulated rate of return, daily rate of return, and Sortino ratio. Additionally, to exemplify the high quality and reliability of the signals generated by EAM, we inspect the position-holding of five different EAMs. Furthermore, we validate the necessity of EAM by back-testing and comparing the EAM-enabled and disabled MSPMs on four different portfolios. The experimental results prove that MSPM is qualified as a stepping stone to inspire more creative system designs in reinforcement learning-based financial portfolio management. Validation: Zhenhan Huang.